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Abstract 


As known, outliers and multicollinearity in the data set are among the 
important difficulties in regression models, which badly affect the least- 
squares estimators. Under multicollinearity and outliers’ existence in the 
data set, the prediction performance of the least-squares regression method 
is decreased dramatically. Here, proposing an approximation for the con- 
dition number, we suggest a nonlinear mixed-integer programming model 
to simultaneously control inappropriate effects of the mentioned problems. 
The model can be effectively solved by popular metaheuristic algorithms. 
To shed light on importance of our optimization approach, we make some 
numerical experiments on a classic real data set as well as a simulated data 


set. 
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1 Introduction 


As an effective statistical tool, multiple regression is widely and increasingly 
used in econometrics, engineering, social sciences, and so on. Generally, in 
a multiple regression model, the relationship between several independent 
(predictor) variables with a dependent (response) variable is investigated. 
The model is formulated by 


y=XBrte, (1) 
where y = (y1,---,Yn)' € R” is a vector of observations on the response 
variable, X = (a1,...,%n,)' € R"*? is a matrix of observations on the 
predictor variables, 3 = (1,...,))' is a vector of unknown regression co- 
efficients, and € = (€,...,€,)| is a vector of error terms with E(e) = 0 and 


E(ee'!) = o7I,, where I, is the unit matrix of order n and o? is an unknown 
constant. The ordinary least-squares estimator (OLSE) of is given by 


B = arg min (y— XB)'(y- XB) 
= 6X y, 


where S = X'X. In regression models with intercept (0, we can simply 
rewrite (1) by considering 8 = (8, 61,...,8,)' and X = (1, X). 


Some difficulties often appear in the regression analysis, such as collinear- 
ity between explanatory variables as well as outliers’ existence in the data 
set. Generally, in the regression modeling an outlier is an observation point 
that fails to track the linear pattern of the data [10]. Outliers corrupt the 
OLSE; this fact motivated the researchers to investigate robust estimations 
[9]. As another problem in regression analysis, one may encounter with mul- 
ticollinearity in the data set that is defined as the existence of nearly linear 
dependency among columns of the design matrix X [7]. In this situation, 
the matrix S = X'X has one or more small eigenvalues which causes the 
OLSE to perform poorly [8]. One effective approach to detect the outliers in 
a data set is the least trimmed squares (LTS) [9] in which the sum of smallest 
h-squared residuals is minimized rather than the complete sum of squares. 
Here, h is a prespecified threshold value and denotes the number of normal 
or good observations that are not outliers. If z; € {0,1} is the indicator 
demonstrating whether the ith observation is ordinary (nonoutlier) or not, 
then the model can be formulated by 


tnin U(B,2) = (y — XB)" X(y- XB) 
st. z"e=h, (2) 
2, © {0,1}, i=1,2,...,n, 
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where Z is a diagonal matrix with the diagonal elements z = (21, 22,.--, Zn) 
and e = (1,...,1)" eR". 

Let A be an arbitrary nonsingular matrix. Denoted by K,(A), the p- 
norm condition number of A defined by &,(A) = ||Allp||A7*||, [12]. When 
A is positive definite and p = 2, then we get the spectral condition number, 
which can be determined as the ratio of the largest eigenvalue to the smallest 
eigenvalue of the matrix A [12]. As known, the condition number is an 
important factor to check the existence of the multicollinearity [12]. Based 
on this fact, Roozbeh, Babaie-Kafaki, and Naeimi Sadigh [8] developed an 
extension of the optimization model (2) to simultaneously control presence 
of the outliers and the multicollinearity in the data set; that is, 


min Y(8, 2) =(y— XB)" Z(y— XB) + pn(X' ZX) 


st, 2 e=h, (3) 
2, € {0,1}, i=1,2,...,n, 


in which «(-) stands for the spectral condition number and pz > 0 is called the 
penalty parameter. Here, the corresponding estimator is called the modified 
LTS counter multicollinearity (MLTSCM) estimator. In model (3), the addi- 
tional term «(X'ZX) > 0 has been embedded as a penalty for generating 
inappropriate values for 21, Z2,..., Zn and decreases the condition number of 
the final model to combat multicollinearity problem. 

Here, we deal with a modified version of the regression model (3) with 
less computational cost in the objective function. This work is organized 
as follows. In Section 2, we introduce a penalized regression method. The 
method is then combined with the LTS method in Section 3, to combat the 
sparsity of the model. In Section 4, we deal with our approximate version of 
the mixed-integer nonlinear programming model (6) using a simple estimation 
of the spectral condition number. In Section 5, we provide some numerical 
experiments to show effectiveness of our model. Finally, concluding remarks 
are presented in Section 6. 


2 Least absolute shrinkage and selection operator 
methodology 


The amount of data we are faced with keeps growing. From around the late 
1990s, wide data sets emerged, in which the number of variables far exceeds 
the number of observations. This was mainly due to our increasing ability to 
measure a large amount of information automatically [6]. 

Penalized regression can perform variable selection and prediction in a 
“Big Data” environment more effectively and efficiently in contrast to the 
other methods. Initially proposed by Tibshirani [11], the LASSO (least ab- 
solute shrinkage and selection operator) is based on minimizing mean squared 
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error, which is based on balancing the opposing factors of bias and variance 
to build the most predictive model. In fact, LASSO shrinks the regression 
coefficients toward zero by penalizing the regression model with an ¢;-norm 
penalty term, that is, the sum of the absolute value of the coefficients. LASSO 
regression is a simple technique to reduce model complexity and usually pre- 
vent over-fitting which may result from simple linear regression. 

In the case of LASSO regression, the penalty term is embedded to force 
the coefficient estimates with minor contributions to the model to be exactly 
set to zero. This means that, LASSO can be also seen as an alternative to the 
subset selection methods for performing variable selection in order to reduce 
complexity of the model. 

Ordinary least squares (OLS) regression chooses the coefficients by mini- 
mizing the residual sum of squares (RSS) as follows: 


n Pp 


min RSS = min(y — gy) (y —y) =min S- Yi — S- Lij By ; 
p ‘a an j=l 
where (%j1,..., Zip) can be called x. LASSO is an extension of the OLS, 


which adds a penalty to the RSS, being equal to sum of the absolute values of 
the nonintercept beta coefficients multiplied by the parameter that slows 
(when \ < 1) or accelerates (when \ > 1) the penalty. Therefore, the 
following optimization problem should be solved in LASSO problem: 


n 


2 
P P 

mit S- yi — >, 83 B; +rS°|B;| 
j=l j=l 


i=l 


Figure 1 shows the constraint area of the LASSO method for p = 2, in 
which elliptical contours of the function are shown by the full. They are 
centered at the OLSE. The constraint region is the rotated square. LASSO 
solution is the first place that the contours touch the square, and this will 
sometimes occur at a corner, corresponding to a zero coefficient. LASSO 
is frequently used in practice since the ¢, penalty allows us to shrink some 
coefficients to zero, that is, to produce sparse estimation models that are 
highly interpretable. 

It is notable that increasing will increase bias and decrease variance. 
Likewise, decreasing \ reduces bias and increases variance. A big part of 
the building, the best models in LASSO deal with the bias-variance tradeoff. 
Bias refers to how correct (or incorrect) the model is. A very simple model 
that makes a lot of mistakes is said to have a high bias. A very complicated 
model that does well on its training data is said to have a low bias. There are 
several ways to choose the optimal A, such as AIC, BIC, C,, and so on [10]. 
For this purpose, one of the most popular methods is the cross validation 
(CV) method. 
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RSS 


By 


Figure 1: Constraint region of LASSO. 


In order to find the optimal value of A, a range of \ values is tested and 
the optimal value is chosen using CV, which involves 


e separating the data into a training set and a test set; 
e building the model in the training set; 


e estimating the outcome in the test set using the model from the training 
set; 


e calculating mean squared error (MSE) in the test set. 


There are different types of cross validation method like Leave-P-Out, 
Leave-one-out, k-fold, standard k-fold, Monte Carlo, and so on [2]. One of 
the most important methods is the k-fold CV, which is one way to improve 
over the holdout method. The data set is divided into k subsets, and the 
holdout method is repeated k times. Each time, one of the k subsets is used 
as the test set and the other k —1 subsets are put together to form a training 
set. Then, the average error across all k trials is computed. The advantage 
of this method is that it matters less how the data gets divided. Every data 
point gets to be in a test set exactly once, and gets to be in a training set k—1 
times. The variance of the resulting estimate is reduced as k is increased. The 
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disadvantage of this method is that the training algorithm has to be rerun 
from scratch k times, which means it takes k times as much computation to 
make an evaluation. A variant of this method is to randomly divide the data 
into the test and the training set k different times. The advantage of doing 
this is that you can independently choose how large each test set is and how 
many trials you average over. 


3 Sparse least trimmed squares method 


As discussed, LASSO offers interpretable models but is not robust with re- 
spect to the outliers. The breakdown point of LASSO is + (see [1] for more 
details); that is, only one single outlier can make the LASSO estimator com- 
pletely unreliable. Therefore, robust alternatives are needed. In this situa- 
tion, Alfons, Croux, and Gelper [1] suggested the sparse LTS estimator as 


follows: 


min (8,2) = {yy — XB)'Z(y— XB) +hrAd%_, \3,l} 
st. z'e= h, 


2, € {0,1}, i=1,2,...,n, 


where A is a penalty parameter. They showed that the breakdown point of 
this estimator is 2—"+*. 


4 A penalized nonlinear mixed-integer programming 
model in linear regression 


As known, the condition number is an effective tool to check the existence 
of multicollinearity [12]. The OLSE performs poorly in the presence of mul- 
ticollinearity. Also, the existence of multicollinearity may lead to wide con- 
fidence intervals for the individual parameters or their linear combinations, 
and can produce estimators with wrong signs. 


As known, if a;,i =1,..., denotes the ith column of A, then, for any 7 
and j, it can be seen that 
Ila 
eA 27a (4) 
Ilajllp 


(see [12, Theorem 2.2.25]). Hence, we can write 


max ||aj||p 
t=1,...,n 


! ©, (A). (5) 


K OM 
of) = an lee 
Gy a eZ 
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Now, based on the above inequality, it is reasonable to select a set of h ap- 
propriate observations of the data that makes the computationally low cost 
term U,(X 'ZX) as small as possible. Hence, we propose the computation- 
ally low cost approximation k,(A) * W,(A) and then, propose the following 
approximation of model (3): 


min Y(8,z) = (y— XB)" Z(y— XB) + wU(X' ZX) 
at zle=h (6) 


d 


2, € {0,1}, i=1,2,...,n. 


Here, the corresponding estimator is called the approximate LTS counter 
multicollinearity (ALTSCM) estimator. As seen, the given optimization 
model illustrates an NP-hard mixed-integer (having both continuous (3) and 
(discrete) integer (z) variables) nonlinear programming problem for which the 
classical methods are not practically efficient [4]. Note that in complexity 
theory, NP-hardness is viewed as strong evidence that a problem is not poly- 
nomially solvable [4]. As known, metaheuristic algorithms have attracted 
special attention in developing efficiently robust computational procedures 
for solving a vast variety of such problems. Among them there is the sim- 
ulated annealing (SA) algorithm [4]. SA is a local search algorithm capable 
of escaping from local optima by use of random hill-climbing moves in the 
search process. It is very efficient in practice and well-developed in theory. 
Motivated by these, here we use the SA algorithm to approximately compute 
ALTSCM. 


5 Numerical experiments 


In this section, we investigate computational efficiency of the given estimator 
firstly on a real data set and then, on a simulated data set. 


5.1 A real data set related to the riboflavin vitamin B2 
production in Bacillus subtilis 


To illustrate usefulness of the suggested strategies, we consider the data set 
about riboflavin vitamin B2 production in Bacillus subtilis, which can be 
found in R package “hdi” [5]. Riboflavin is one of the B vitamins, which are 
all water soluble. Riboflavin is naturally present in some foods, added to some 
food products, and available as a dietary supplement. This vitamin is an es- 
sential component of two major coenzymes, flavin mononucleotide (FMN; also 
known as riboflavin-5’-phosphate), and flavin adenine dinucleotide (FAD). 
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Figure 2: LASSO plots for the riboflavin data set. 


There is a single real valued response variable, which is the logarithm of 
the riboflavin production rate. Furthermore, there are p = 4088 explanatory 
variables measuring the logarithm of the expression level of 4088 genes. There 
is also one rather homogeneous data set from n = 7 samples that were hy- 
bridized repeatedly during a fed batch fermentation process, where different 
engineered strains and strains grown under different fermentation conditions 
were analyzed. There is one rather homogeneous data set from n = 71 sam- 
ples that were hybridized repeatedly during a fed batch fermentation process, 
where different engineered strains and strains grown under different fermen- 
tation conditions were analyzed. In Figure 2, the 10-fold cross-validation and 
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Figure 3: Outlier detection plots for the riboflavin data set. 


the coefficients estimation diagrams for different values of the penalty param- 
eter are depicted. We plotted Figure 2 to find the best value of the LASSO 
parameter (\,,), which minimizes the CV criterion. As seen in Figure 2, the 
minimal mean squared error, estimated by CV, is achieved at A,, = 0.01. The 
LASSO method selects 53 variables. Figure 3 produces the diagnostic plots 
for a sequence of regression models, such as submodels along a robust sparse 
least trimmed squares regression models for a grid of values for the penalty 
parameter. In the normal Q-Q plot of the standardized residuals, a reference 
line is drawn through the first and third quartiles. The index number of ob- 
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servations with the largest distances from that line are identified by a label 
(the observation number). In the plots of the standardized residuals versus 
their index or the fitted values, horizontal reference lines are drawn at 0 and 
+2.5. The index number of observations with the largest absolute values of 
the standardized residuals is identified by a label (the observation number). 
For the regression diagnostic plot, the robust Mahalanobis distances of the 
predictor variables are computed via the minimum covariance determinant 
(MCD) based on only those predictors with nonzero coefficients. Horizontal 
reference lines are drawn at +2.5 and a vertical reference line is drawn at 
the upper 97.5% quantile of the chi-squared distribution with p degrees of 
freedom, where p denotes the number of predictors with nonzero coefficients. 
The index number of observations with the largest absolute values of the 
standardized residuals and/or largest robust Mahalanobis distances are iden- 
tified by a label (the observation number). According to Figure 3, it is clear 
that there exist some outliers in the data and so, it is necessary to use the 
robust methods same as the sparse LTS and ALTSCM methods for modeling 
the data. 


We reported the results for the three methods listed in Table 1, in which 
we numerically calculated RSS = (y — 9) '(y — g) and R? = 1 — SSE/S,, 
with y = XB and Sy, = 7), 2i(y; — 9). They are measures for the error 
and goodness of prediction, respectively. It is clear that the penalized mixed- 
integer method performs better than the other methods according to the 
goodness of fit criteria. 


5.2 A simulated data analysis 


To examine the performance of the proposed estimators, we perform a sim- 
ulation data study. For this purpose, the following model is considered with 
n = 55, p = 450, and h = 41 (see [3] for more details): 


y=XBt+e, 
where 
fied. 
B= (B;, 82)", By = (—1.5, 2, 2.5, 4, —3, ye Bo (444x1) ~ N(0, 1), 


£1,..., 8450 ~ Naso(1aso, L450), 
4.i.d. 


€ = (€1,€2)", €1 (nx1) YN (0,1), €2 (n—n)x1) ~ t2(8), 
where t,,(6) is the noncentral t-student distribution with m degrees of free- 
dom and noncentrality parameter 0. 


We produced the first h and the last n—h error terms as random variables 
from dependent normal and independent noncentral t-student distributions, 
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Method LASSO Sparse LTS ALTSCM 
Effective genes Estimation Effective genes Estimation Effective genes Estimation 
Intercept 0.6712 Intercept .3093 Intercept = 
ARGF_at -0.1911 CHED_at -0.0895 ABH_at 0.0124 
DNAJ_at -0.1364 CSBA_at -0546 ALD_at 0.0124 
GAPB_at 0.0380 CT AA_at .0336 AMY C_at 0.0205 
LY SC_at -0.2977 DEAD_at -0.1260 ARGB_at -0.0132 
PKSA_at 0.0236 MREBH_at -0.0819 ARGF_at -0.0177 
PRIA_at 0.0982 SPOIISB_at .0248 ARGG_at -0.0133 
SPOIIAA_at 0.0224 THIK_at -0.2799 ARGH _at -0.0160 
SPOVAA_at 0.2652 XKDI_at .0001 BIOB_at -0.0133 
THIK_at -0.0051 XKDS_at .0988 CARA_at -0.0140 
XHLB_at 0.1608 YCDH_at -0.0018 CARB_at -0.0145 
YACN_at -0.0404 YCGM_at -0.0257 GAPB_at 0.0130 
YBFI_at 0.1329 YCGP_at -0.0168 GSIB_at -0.0129 
YCDH_at -0.0067 YCKJ_at -0.0761 NDK_at -0.0129 
YCGO_at -0.0057 Y DAR_at -0.0175 PAND_at -0.0128 
YCKE_at 0.0092 YDBM_at -0.3054 PBUX_at 0.0140 
YCLB_at 0.1994 YFLL_at -0.0542 PHRI_r_at -0.0128 
YCLF_at -0.0533 YHCB_at -0.1021 PTSG_at -0.0120 
YDDH_at -0.0176 YHCL_at -0.0850 SIGY_at -0.0155 
YDDK_at -0.1142 YHDO_at -0.0762 SSPA_at -0.0122 
Y EBC_at -0.5347 YJBJ_at -0.1005 Y BGB_at -0.0139 
YFHE_r_at 0.1451 YJCJ_at 0.2446 YCDH_at -0.0307 
YFII_at 0.0100 Y KUH_at 0.2855 YCGM_at -0.0183 
YFIO_at 0.1588 YORB_i_at 0.0519 YCGN_at -0.0206 
YFIR_at 0.0441 YQET_at -0.0217 YCGO_at -0.0207 
YHDS_r_at 0.1452 YRZI_r_at 0.0109 YCIC_at -0.0229 
YKBA_at 0.1108 YTGB_at -0.0203 YDDH_at -0.0131 
YKVJ_at 0.0237 YUIA_at 0.0113 YFMH_r_at  -0.0164 
YLXW_at 0.0731 YVBY_at 0.0420 YHDS_r_at 0.0166 
YMFE_at 0.0183 YWQD_at -0.0098 YHFH_r_at 0.0178 
YOAB_at -0.8123 YXEL_at -0.0442 YHZA_at -0.0396 
YPGA_at -0.0102 YXLD_at -0.0389 YOEBat -0.0122 
YQJT_at 0.0415 YXLE_at -0.1714 YOPSat 0.0149 
YQJU_at 0.2320 YOQX_i_at 0.0155 
YRVJ_at -0.0547 Y PUD_at 0.0349 
YTGB_at -0.0390 Y PUF_at 0.0248 
YUID_at 0.0134 Y PUG_at 0.0208 
YURQ_at 0.0245 YRPE_at -0.0150 
YXLD_at -0.2005 YRZI_r_at 0.0173 
YXLE_at -0.1068 YTGA_at -0.0168 
YY BG_at -0.0781 YTGB_at -0.0191 
YYDA_at -0.1042 YTGC_at -0.0129 
YTGD_at -0.0198 
YTIA_at -0.0290 
YUSA_at -0.0126 
YXLC_at -0.0204 
YXLD_at -0.0270 
YXLE_at -0.0270 
YXLF_at -0.0194 
YXLG_at -0.0216 
RSS 594.0313 25.6582 0.1487 
R? 0.6519 0.9093 0.9458 


Table 1: Results of the proposed estimators for the riboflavin data set 
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respectively. Such data generating method makes the outliers to appear on 
one side of the true regression hyperplane, corrupting the nonrobust estima- 
tors by tending them to the outliers. 


In Table 2, we have reported the results. To save space, the estimations of 
nonzero coefficients are just reported in Table 2. Also, the outliers are shown 
in Figure 4. As seen, ALTSCM is again preferable to the other methods. 
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Figure 4: Outlier detection plots for the simulated data set. 
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Method LASSO Sparse LTS ALTSCM 
Parameters 

By 0.00000 0.00000 -1.44790 
Be 0.00000 0.00000 1.98652 
B3 2.08386 2.69022 2.48500 
Ba 3.83184 3.47647 4.01102 
Bs -3.01472 -2.03737 -2.99656 
Be 2.27159 2.87485 4.94529 
RSS 4203.47 2850.05 4.71450 
R? 0.14287 0.41885 0.93325 


Table 2: Results of the proposed estimators for the simulated data set 


6 Conclusions 


We dealt with a computationally low cost nonlinear mixed-integer optimiza- 
tion model to combat both the outliers and multicollinearity effects in the 
high dimensional regression. Suggesting an approximation for the condition 
number, our model also has a simple (LTS-based) structure. We used the 
simulated annealing algorithm to solve the model effectively. Computational 
tests showed that the given estimator (ALTSCM) is practically promising. 
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